Week 05
Data Cleaning and Probability Simulation

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-22

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this week, you will be able to:

  1. Clean and prepare raw data systematically
  2. Write tests and documentation for data cleaning
  3. Handle missing values appropriately
  4. Understand probability distributions
  5. Simulate data for testing and validation

Part 1: Data Cleaning Workflow

Why Data Cleaning Matters

The Reality of Real Data

Most data you encounter will be messy. Data cleaning often takes 80% of analysis time but receives far less attention than modelling.

Data cleaning is not just about fixing errors—it’s about:

  • Understanding your data deeply
  • Ensuring reproducibility
  • Building trust in your analysis
  • Documenting your decisions

The Data Science Workflow

Remember our workflow from Week 1:

\[\text{Plan} \rightarrow \text{Simulate} \rightarrow \text{Acquire} \rightarrow \text{Explore} \rightarrow \text{Share}\]

Data cleaning sits between Acquire and Explore, but planning for it begins much earlier.

Step 1: Save the Original Data

Golden Rule

Always save the original, unedited data in a separate folder and never modify it.

Why?

  • Reproducibility: Others can verify your work
  • Flexibility: You might change cleaning decisions later
  • Provenance: You can trace what happened to the data

Step 2: Plan Your Endpoint

Before cleaning, sketch the dataset you want to end up with:

# Example: Planning a dataset of Australian state populations
set.seed(2024)
simulated_population <- tibble(
  state = c("New South Wales", "Victoria", "Queensland", 
            "Western Australia", "South Australia", "Tasmania",
            "Northern Territory", "Australian Capital Territory"),
  population = runif(n = 8, min = 0.2, max = 8) |> round(digits = 2)
)
simulated_population
# A tibble: 8 × 2
  state                        population
  <chr>                             <dbl>
1 New South Wales                    6.73
2 Victoria                           2.7 
3 Queensland                         5.51
4 Western Australia                  5.65
5 South Australia                    3.76
6 Tasmania                           5.67
7 Northern Territory                 3.44
8 Australian Capital Territory       2.57

Planning Forces Decisions

By simulating first, you must decide:

  • Column names: Full names or abbreviations?
  • Units: Population in millions or raw numbers?
  • Data types: Character, numeric, factor?
  • Valid ranges: What values are reasonable?

Tidy Data Principles

Plan for “tidy data”: each variable in its own column, each observation in its own row, each value in its own cell.

Step 3: Start Small

When working with messy data, get it into a rectangular format as quickly as possible.

# Example: Messy string data
unedited_data <- c("NSW 8.2 VIC 6.5 QLD 5.1")

# Convert to tidy format
tidy_data <- tibble(raw = unedited_data) |>
  separate_longer_delim(raw, delim = " ") |>
  mutate(type = rep(c("state", "population"), length.out = n())) |>
  mutate(row = rep(1:(n()/2), each = 2)) |>
  pivot_wider(names_from = type, values_from = raw) |>
  select(-row) |>
  mutate(population = as.numeric(population))

tidy_data
# A tibble: 3 × 2
  state population
  <chr>      <dbl>
1 NSW          8.2
2 VIC          6.5
3 QLD          5.1

Step 4: Write Tests

Tests Are Not Optional

Tests ensure your cleaning code does what you think it does, and catches problems early.

Basic test structure using stopifnot():

# Test that all populations are positive
stopifnot(all(tidy_data$population > 0))

# Test that we have the expected number of states
stopifnot(nrow(tidy_data) == 3)

# Test that population is numeric
stopifnot(is.numeric(tidy_data$population))

Using the testthat Package

library(testthat)

# More readable tests
expect_equal(nrow(tidy_data), 3)
expect_true(all(tidy_data$population > 0))
expect_type(tidy_data$population, "double")

If tests pass, nothing happens. If they fail, execution stops with an informative error.

What to Test?

Validity Tests

  • Correct data types (class)
  • Values within expected ranges
  • No impossible values
  • Correct number of observations

Consistency Tests

  • Internal: Do columns sum correctly?
  • External: Do values match known facts?
  • Temporal: Are dates in order?

Step 5: Iterate and Generalise

Data cleaning is iterative:

  1. Plan → 2. Clean a sample → 3. Test → 4. Refine → 5. Scale up
# Example: Fixing common text issues
messy_names <- c("Patricia", "Ptricia", "PatricIa", "Patric1a", "Patricia ")

cleaned_names <- messy_names |>
  str_to_title() |>           # Fix capitalisation
  str_replace_all("1", "i") |> # Fix OCR errors (1 → i)
  str_trim()                   # Remove trailing spaces

tibble(original = messy_names, cleaned = cleaned_names)
# A tibble: 5 × 2
  original    cleaned 
  <chr>       <chr>   
1 "Patricia"  Patricia
2 "Ptricia"   Ptricia 
3 "PatricIa"  Patricia
4 "Patric1a"  Patricia
5 "Patricia " Patricia

Part 2: Checking and Testing

Three Tools for Finding Anomalies

1. Graphs

Visualise distributions and relationships to spot outliers

2. Counts

Identify rare values and common errors

3. Tests

Programmatically verify expectations

Using Graphs to Find Problems

# Simulated youth survey with an error
youth_survey_data <- tibble(
  ages = c(15.9, 14.9, 16.6, 15.8, 16.7, 17.9, 12.6, 11.5, 16.2, 19.5, 150)
)

youth_survey_data |>
  ggplot(aes(x = ages)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() +
  labs(x = "Age of respondent", y = "Count", 
       title = "Spot the problem!")

Using Counts to Prioritise

# Where to focus cleaning efforts?
country_names <- tibble(
  country = c("Australia", "Australie", "Australia", "Austrelia",
              "Australia", "australia", "Australia", "Australie")
)

country_names |>
  count(country, sort = TRUE)
# A tibble: 4 × 2
  country       n
  <chr>     <int>
1 Australia     4
2 Australie     2
3 Austrelia     1
4 australia     1

Focus on fixing “Australie” first—it gives the biggest improvement!

Testing Class Is Critical

Class Affects Everything

Getting the class of variables wrong can completely change your analysis results!

# Same data, different classes → different results
data <- tibble(
  response = c(1, 1, 0, 1, 0, 1),
  group_numeric = c(1, 2, 1, 2, 1, 2),
  group_factor = factor(c(1, 2, 1, 2, 1, 2))
)

# These give DIFFERENT regression results!
# lm(response ~ group_numeric, data = data)  # Treats as continuous
# lm(response ~ group_factor, data = data)   # Treats as categorical

Specifying Column Types

Be Explicit About Types

Don’t let R guess—specify column types when reading data.

# Instead of this:
data <- read_csv("myfile.csv")

# Do this:
data <- read_csv(
  "myfile.csv",
  col_types = cols(
    name = col_character(),
    age = col_integer(),
    income = col_double(),
    state = col_factor()
  )
)

Testing Dates

Dates deserve special attention because they often go wrong:

library(lubridate)

# Common date formats
dates_raw <- c("2024-03-15", "15/03/2024", "March 15, 2024")

# Parse with appropriate functions
ymd("2024-03-15")
[1] "2024-03-15"
dmy("15/03/2024")
[1] "2024-03-15"
mdy("March 15, 2024")
[1] "2024-03-15"

Always use ISO 8601 format (YYYY-MM-DD) in your cleaned data!

The clean_names() Function

library(janitor)

# Messy column names
messy_data <- tibble(
  `First Name` = "Alice",
  `weird#symbol` = 42,
  `InCoNsIsTaNtCaPs` = TRUE
)

messy_data |>
  clean_names()
# A tibble: 1 × 3
  first_name weird_number_symbol in_co_ns_is_ta_nt_ca_ps
  <chr>                    <dbl> <lgl>                  
1 Alice                       42 TRUE                   

Part 3: Probability Foundations

Why Probability?

Probability helps us:

  1. Model variation in the world
  2. Understand sampling distributions of estimates
  3. Represent uncertainty in predictions
  4. Train our intuitions about randomness

The Normal Distribution

Central Limit Theorem

The sum of many small, independent random variables approximately follows a normal distribution.

tibble(x = seq(-4, 4, 0.01)) |>
  mutate(density = dnorm(x)) |>
  ggplot(aes(x = x, y = density)) +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", colour = "blue") +
  annotate("text", x = 0, y = 0.2, label = "68%", size = 5) +
  theme_minimal() +
  labs(title = "Standard Normal Distribution", 
       subtitle = "68% of values fall within 1 SD of the mean")

Key Properties of the Normal

For a normal distribution with mean μ and standard deviation σ:

  • 50% of values fall within ±0.67σ of μ
  • 68% of values fall within ±1σ of μ
  • 95% of values fall within ±2σ of μ
  • 99.7% of values fall within ±3σ of μ
# Example: Heights of Australian women (mean = 162cm, SD = 7cm)
qnorm(c(0.025, 0.975), mean = 162, sd = 7)  # 95% interval
[1] 148.2803 175.7197

Other Important Distributions

# Probability of exactly 3 heads in 10 coin flips
dbinom(3, size = 10, prob = 0.5)
# Probability of exactly 5 emails per hour (avg = 3)
dpois(5, lambda = 3)

Binomial Distribution

For counting successes in n independent trials, each with probability p.

[1] 0.1171875

Poisson Distribution

For counting events that occur at a constant average rate.

[1] 0.1008188

Mean, Variance, and Standard Deviation

For any distribution:

  • Mean (μ): The average value (centre of the distribution)
  • Variance (σ²): Average squared distance from the mean
  • Standard Deviation (σ): Square root of variance (same units as data)
# Sample calculations
heights <- c(165, 170, 155, 180, 160, 175, 168)
mean(heights)
[1] 167.5714
var(heights)
[1] 72.95238
sd(heights)
[1] 8.541217

Part 4: Simulation in R

Why Simulate?

Simulation is powerful because it lets us:

  1. Understand distributions by generating data from them
  2. Test our methods on data where we know the truth
  3. Propagate uncertainty through complex calculations
  4. Validate cleaning procedures by creating test cases

Simulating from Distributions

set.seed(2024)  # For reproducibility

# Simulate 1000 values from different distributions
sims <- tibble(
  normal = rnorm(1000, mean = 5, sd = 2),
  uniform = runif(1000, min = 0, max = 10),
  poisson = rpois(1000, lambda = 5)
)

sims |>
  pivot_longer(everything(), names_to = "distribution", values_to = "value") |>
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~distribution, scales = "free") +
  theme_minimal()

Key R Functions for Simulation

Distribution Random Density Quantile
Normal rnorm() dnorm() qnorm()
Binomial rbinom() dbinom() qbinom()
Uniform runif() dunif() qunif()
Poisson rpois() dpois() qpois()
  • r*() generates random values
  • d*() gives probability density/mass
  • q*() gives quantiles (inverse CDF)

Example: How Many Girls in 400 Births?

# Probability of a girl is approximately 48.8%
set.seed(2024)
n_sims <- 1000

n_girls <- rbinom(n_sims, size = 400, prob = 0.488)

tibble(n_girls = n_girls) |>
  ggplot(aes(x = n_girls)) +
  geom_histogram(binwidth = 5, fill = "steelblue", colour = "white") +
  geom_vline(xintercept = 400 * 0.488, colour = "red", linewidth = 1) +
  theme_minimal() +
  labs(x = "Number of girls", y = "Frequency",
       title = "Simulated number of girls in 400 births")

Using Simulation for Validation

Fake-Data Simulation

Simulate data from your model, then check that your analysis can recover the true parameters.

# Simulate data where we KNOW the true relationship
set.seed(2024)
n <- 100
true_intercept <- 5
true_slope <- 2
true_sd <- 1

fake_data <- tibble(
  x = runif(n, 0, 10),
  y = true_intercept + true_slope * x + rnorm(n, 0, true_sd)
)

# Fit model and check if we recover the truth
model <- lm(y ~ x, data = fake_data)
coef(model)  # Should be close to 5 and 2
(Intercept)           x 
   4.867021    2.019860 

Simulation for Data Cleaning Tests

Use simulation to create test datasets with known properties:

# Simulate a "clean" dataset
set.seed(2024)
clean_ages <- tibble(
  age = round(rnorm(100, mean = 45, sd = 15))
) |>
  mutate(age = pmax(18, pmin(age, 100)))  # Bound between 18-100

# Now introduce some realistic errors
messy_ages <- clean_ages |>
  mutate(
    age = case_when(
      row_number() == 5 ~ 450,      # Typo: extra digit
      row_number() == 10 ~ -25,     # Sign error
      row_number() == 15 ~ NA_real_, # Missing value
      TRUE ~ age
    )
  )

# Test your cleaning code on this!

The Power of replicate()

# Simulate the sampling distribution of the mean
simulate_mean <- function(n, mu, sigma) {
  sample_data <- rnorm(n, mean = mu, sd = sigma)
  mean(sample_data)
}

# Run 1000 simulations
means <- replicate(1000, simulate_mean(n = 30, mu = 100, sigma = 15))

# The sampling distribution
cat("Mean of means:", round(mean(means), 2), "\n")
Mean of means: 99.97 
cat("SE of mean:", round(sd(means), 2), "\n")
SE of mean: 2.71 
cat("Theoretical SE:", round(15/sqrt(30), 2))
Theoretical SE: 2.74

Part 5: Statistical Inference Basics

The Sampling Distribution

Key Concept

The sampling distribution is the distribution of a statistic (like the mean) across many hypothetical samples from the same population.

It arises from:

  • Random sampling from a population
  • Measurement error in observations
  • Model error from imperfect models

Estimates and Standard Errors

For any estimate:

  • Point estimate: Our best single guess (e.g., sample mean)
  • Standard error: The standard deviation of the sampling distribution
# Standard error of a proportion
n <- 1000  # Sample size
p_hat <- 0.55  # Observed proportion

# Standard error formula for proportions
se <- sqrt(p_hat * (1 - p_hat) / n)
cat("Estimate:", p_hat, "\nStandard error:", round(se, 4))
Estimate: 0.55 
Standard error: 0.0157

Confidence Intervals

A 95% confidence interval:

\[\text{estimate} \pm 2 \times \text{standard error}\]

# 95% CI for the proportion
ci_lower <- p_hat - 2 * se
ci_upper <- p_hat + 2 * se
cat("95% CI: [", round(ci_lower, 3), ",", round(ci_upper, 3), "]")
95% CI: [ 0.519 , 0.581 ]

Interpretation

If we repeated the study many times, 95% of the confidence intervals would contain the true parameter.

Standard Error for Comparisons

For comparing two independent groups:

\[\text{SE}_{\text{difference}} = \sqrt{\text{SE}_1^2 + \text{SE}_2^2}\]

# Example: Gender gap in survey
n_men <- 400; p_men <- 0.57
n_women <- 600; p_women <- 0.45

se_men <- sqrt(p_men * (1 - p_men) / n_men)
se_women <- sqrt(p_women * (1 - p_women) / n_women)

difference <- p_men - p_women
se_diff <- sqrt(se_men^2 + se_women^2)

cat("Difference:", difference, "\nSE:", round(se_diff, 3))
Difference: 0.12 
SE: 0.032

Problems with Statistical Significance

Common Misinterpretations

  1. Statistical significance ≠ practical importance
  2. Non-significance ≠ no effect
  3. “Significant” vs “not significant” may not itself be significant
  4. P-hacking inflates false positive rates

Better approach: Report estimates with uncertainty intervals and interpret substantively.

Part 6: Putting It Together

The Simulation-First Workflow

  1. Plan your desired dataset structure
  2. Simulate fake data with known properties
  3. Write tests based on the simulation
  4. Acquire real data
  5. Clean using your tested procedures
  6. Validate by checking tests pass

Example: Running Times Data

# Step 1: Simulate expected data
set.seed(2024)
n <- 200
expected_ratio <- 8.4  # Marathon ≈ 8.4 × 5km time

sim_run_data <- tibble(
  five_km = runif(n, min = 15, max = 30),  # 15-30 minutes
  marathon = five_km * expected_ratio + rnorm(n, 0, 20)
) |>
  mutate(across(everything(), ~round(.x, 1)))

# Step 2: Write tests
stopifnot(
  all(sim_run_data$five_km >= 15),
  all(sim_run_data$five_km <= 30),
  all(sim_run_data$marathon >= 100),  # No sub-100 min marathons!
  all(sim_run_data$marathon <= 350)
)
Error: all(sim_run_data$marathon >= 100) is not TRUE

Visualising the Simulated Data

sim_run_data |>
  ggplot(aes(x = five_km, y = marathon)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  theme_minimal() +
  labs(x = "5km time (minutes)", y = "Marathon time (minutes)",
       title = "Simulated Running Times",
       subtitle = "Expected relationship: marathon ≈ 8.4 × 5km time")

Summary: Key Functions

Task R Functions
Data cleaning mutate(), filter(), select(), str_replace(), clean_names()
Testing stopifnot(), expect_equal(), expect_true()
Simulation rnorm(), rbinom(), runif(), sample(), replicate()
Summaries mean(), sd(), var(), quantile()
Dates ymd(), dmy(), mdy() from lubridate

This Week’s Readings

TSwD (Alexander, 2023)

  • Chapter 9: Clean and prepare (9.2-9.7)

ROS (Gelman, Hill & Vehtari, 2021)

  • Chapter 3: Some basic methods in mathematics and probability
  • Chapter 4: Statistical inference
  • Chapter 5: Simulation

Next Week

Week 6: Simple Linear Regression

  • Fitting and interpreting linear models
  • The least squares criterion
  • Making predictions
  • Using simulation to understand regression

Questions?

Lab This Week

Practice data cleaning and simulation with real Australian census data.